Reliability of detection of ultrasound and MRI features of hand osteoarthritis: a systematic review and meta-analysis

Abstract Objectives To systematically review the literature on inter- and intra-rater reliability of scoring US and MRI changes in hand OA. Methods MEDLINE, EMBASE, CINHAL, Web of Science and AMED were searched from inception to January 2020. Kappa (κ), weighted kappa (κw) and intra-class correlation coefficients for dichotomous, semi-quantitative and summated scores, respectively, and their 95% CI were pooled using a random-effects model. Heterogeneity between studies was assessed and reliability estimates were interpreted using the Landis–Koch classification. Results Fifty studies met the inclusion criteria (29 US, 17 MRI, 4 involving both modalities). The pooled κ (95% CI) for inter-rater reliability was substantial for US-detected osteophytes [0.66 (0.54, 0.79)], grey-scale synovitis [0.64 (0.32, 0.97)] and power Doppler [0.76, (0.47, 1.05)], whereas intra-rater reliability was almost perfect for osteophytes [0.82 (0.80, 0.84)], central bone erosions (CBEs) [0.83 (0.78, 0.89)] and effusion [0.83 (0.74, 0.91)], and substantial for grey-scale synovitis [0.64 (0.49, 0.79)] and power Doppler [0.70 (0.59, 0.80)]. Inter-rater reliability for dichotomous assessment was substantial for MRI-detected CBEs [0.75 (0.67, 0.83)] and synovitis [0.69 (0.51, 0.87)], slight for osteophytes [0.14 (0.04, 0.25)], and almost perfect for sum score of osteophytes, CBEs, joint space narrowing (JSN), and bone marrow lesions (BMLs) (0.81–0.89). Intra-rater reliability was almost perfect for sum score of MRI synovitis [0.92 (0.87, 0.96)], BMLs [0.88 (0.78, 0.98)], osteophytes [0.86 (0.74, 0.98)], CBEs [0.83 (0.66, 1.00)] and JSN [0.91 (0.87, 0.91)]. Conclusion US and MRI are reliable in detecting hand OA features. US may be preferred due to low cost and increasing availability.


Introduction
Symptomatic hand OA is common among community dwelling adults, and its prevalence increases with increasing age [1,2]. People with hand OA often experience pain, stiffness and impaired function [3][4][5]. Just as for other forms of arthritis, imaging is central to understanding the disease course, outcome and pathophysiology of hand OA. EULAR recommends imaging in hand OA if there is an unexpected rapid progression of symptoms or change in clinical characteristics, with plain radiographs as the first line of imaging investigation [6]. However, plain radiographs are limited by inability to visualize synovial changes that are apparent on other imaging modalities such as US and MRI, as well as osseous changes, e.g. bone marrow lesions (BMLs), that are assessable on MRI [7].
In the past two decades, US and MRI have been used extensively to assess hand OA. US provides an inexpensive, safe and non-invasive means of assessing changes such as joint effusion, grey-scale synovitis (GSS), hyper-vascularity, osteophytes and erosions [8,9], while MRI has the additional advantage of demonstrating BMLs [10]. However, MRI is relatively expensive and most MRI coils have limited field of view and can only image the second to fifith distal and proximal interphalangeal joints (DIPJs and PIPJs) of one hand at a time.
Though several methods to score changes in hand OA have been developed for both US [11] and MRI [12,13] varying levels of reliability have been summarized for US in a previous narrative systematic review [14]. The reliability of assessment of features of hand OA using MRI has not been systematically reviewed and the reliability of these two imaging modalities in detecting hand OA changes has not been compared. Therefore, we aimed to systematically review the intra-and inter-rater reliability of US and MRI in detecting changes of hand OA.  Supplementary Table S1, available at Rheumatology online. The search protocol was registered in PROSPERO (CRD42018095677).

Inclusion and exclusion criteria
Studies were selected if they utilized US or MRI to investigate hand OA and reported inter-or intra-rater reliability. Studies investigating people with other forms of arthritis, e.g. rheumatoid or psoriatic arthritis, and nonhuman studies were excluded. Conference abstracts were excluded since they contain insufficient data for the purposes of a systematic review. No language restrictions were applied in the search.

Data extraction and outcome measures
Information extracted included publication year, country, diagnostic criteria, study design, method of selecting joints or participants for reliability assessment, imaging method(s), joints assessed, scoring method(s), training background of assessor(s), and reliability measures such as kappa coefficient [weighted (j w ) and unweighted (j)], intra-class correlation coefficient (ICC) and 95% CI. US and MRI features examined included osteophytes, CBE, JSN, effusion and synovitis. PD changes and MRIdetected BMLs were also examined. Where multiple publications reported reliability estimates using the same assessor(s), either the earliest publication or the publication with the most comprehensive data was included in this review. For publications providing reliability results for multiple assessors separately, each assessor's reliability result was considered as separate data points. Where a publication used multiple assessors and reported average reliability coefficients for two sessions, data from only the first session was included if they used the same set of assessors for both sessions, or both sessions were included as separate data points if they did not use exactly the same set of assessors. For studies reporting reliability coefficient without 95% CIs, we utilized a metaanalysis effect size calculator to estimate 95% CIs given sample size and correlation coefficient [15].

Quality assessment
The Newcastle-Ottawa scale (NOS) for case-control and cohort studies [16] and a modified NOS for crosssectional studies [17] were used for quality assessment. This uses the star system, ranging from 0-9 stars for case-control and cohort studies, and 0-10 stars for cross-sectional studies. A high number of stars denotes good quality.

Validation methods
The screening of titles, abstracts and full-texts, data extraction and risk of bias assessment were performed by one reviewer (A.D.O.). Two second reviewers (J.K. and S.S.), already trained in systematic review methods, independently repeated the assessments on a randomly selected sample for validation. J.K. screened titles and abstracts of 100 citations. S.S. screened full texts, assessed risk of bias and extracted data on 10% (n ¼ 18), 20% (n ¼ 6) and 10% (n ¼ 5) of eligible articles, respectively. Discrepancies were discussed and resolved with the senior author (A.A.).

Study selection
Our search identified 6095 citations, of which 183 citations were selected for full-text review after screening of titles and abstracts. Fifty-two studies met the inclusion criteria. However, two studies were later excluded because one performed reliability assessment on fusion of US and MRI images (fusion imaging) [22] and the other performed reliability on finger, knee, hip and ankle joints, and did not provide separate estimates for individual joint [23]. This left 50 studies for inclusion in the final analysis (29 used US only, 17 used MRI only and four involved both imaging modalities.) The literature search and screening flowchart is presented in Fig. 1. Agreements between A.D.O. and the second reviewers for screening procedures and risk of bias assessment were all excellent.

Study characteristics
Fifty articles published between 2005 and 2020 were included in this review, consisting of 240 (3654 joints) and 130 (932 joints) participants for US and MRI assessments, respectively. The majority of studies (n ¼ 44) recruited participants from specialist hospital clinics and used the ACR and/or plain radiographic criteria (n ¼ 38). Participants or images used for reliability assessment were selected randomly in 15 studies, serially in 11 studies, and selected to represent disease severity in four studies, while this was unclear in 20 studies. The majority of studies utilized the outcome measures in rheumatology (OMERACT) definitions for US (22 of 33) and MRI (13 of 21) pathologies. The US probes used across studies had a minimum frequency of 5 MHz and a maximum frequency of 22 MHz, with images acquired using frequency range of 11-18 MHz in most of the studies (31 of 33). One study acquired images using a frequency of 22 MHz [24], while this was unclear in one study [25]. MRI scanner strengths ranged from 1.0 to 3.0 T and the majority of MRI studies (18 of 21) performed assessment of synovitis on contrast-enhanced scans (Supplementary Tables S2, available at Rheumatology online). The median quality scores for risk of bias were 8 (0-9 scale) for cohort, 6 (0-9 scale) for case-control and 6 (0-10 scale) for cross-sectional studies (Supplementary Table S3, available at Rheumatology online).
Inter-rater reliability of assessment of US features of hand OA The 14 US studies that reported inter-rater reliability [11,[25][26][27][28][29][30][31][32][33][34][35][36][37] provided data for osteophytes (n ¼ 8), JSN (n ¼ 2), CBE (n ¼ 2), effusion (n ¼ 2), GSS (n ¼ 11) and PD (n ¼ 7). The pooled j (95% CI) was substantial for   [11,34,37]]. Heterogeneity between studies was considerable in these analyses. A significant publication bias was present for only PD (Table 1). Reliability data were not pooled for CBE and effusion due to insufficient data reporting, use of variable outcome measures and use of same assessors in multiple studies. However, inter-rater agreement ranged from substantial to almost perfect for effusion and CBEs [34,35]. Inter-rater reliability estimates from individual studies are presented in Supplementary    [38,43]]. Heterogeneity between studies was considerable in these analyses, except for dichotomous assessment of osteophytes and sum score of PD, which were unimportant (Table 2). Analysis for intra-rater reliability for assessment of JSN was not performed due to insufficient data and heterogeneity in the report pattern. For example, of five studies that reported this, one reported 100% agreement for DIPJs and PIPJs [51], two reported substantial agreement (j ¼ 0.64) using the same cohort and assessors [39,40], and two reported j range for cartilage abnormalities [27,30]. Details are available in Supplementary  Table S5, available at Rheumatology online.
Inter-rater reliability for assessment of MRI synovitis and BMLs was numerically better for raters with a train- . In these analyses, heterogeneity was unimportant to moderate for rheumatology trained assessors and moderate to considerable for non-rheumatology trained assessors. The OMERACT scoring method produced a higher level of reliability for BML and synovitis with nonsignificant heterogeneity than the Oslo scoring method. Additionally, reliability assessment of synovitis, osteophytes and CBE was better for assessments performed on images acquired with 1.0 T than 1.5 T scanners (Table 4).

Sensitivity analysis
All US studies either assessed reliability on the same stored image or on real-time scans repeated on the same day, with the notable exception of two studies that assessed on real-time scans repeated after 1 week [49] and 12 weeks [50]. Data from for these two studies were excluded in a sensitivity analysis for inflammatory features. There was no significant difference with the two studies excluded, except for GSS where reliability reduced from substantial to moderate (Supplementary  Table S9, available at Rheumatology online). Furthermore, contrast enhancement was used in all MRI studies involved in the analysis for synovitis, except one study [48]. There was no observable difference in the results when this study was excluded from the analysis (Supplementary Table S10, available at Rheumatology online).

Discussion
This is the first comprehensive systematic review and meta-analysis of the reliability of US and MRI in detecting features of hand OA. The key findings of this study are: (i) agreement was moderate to almost perfect for US; and (ii) agreement was slight to almost perfect for MRI features of hand OA. Our findings for inter-and intra-rater reliability of assessment of US features were consistent with those reported in a previous metaanalysis for knee OA [14].
Generally, intra-rater reliability was higher than interrater reliability for both US and MRI. Reliability was numerically lower for US than MRI, particularly for synovitis. However, inter-rater agreement for binary scoring of osteophytes was slight for MRI but substantial for US. This finding should be interpreted with care since only two studies were involved in the pooled estimate for MRI-detected osteophytes. US is widely perceived as an operator-dependent technique, and several factors such as probe positioning, acquisition of images in real-time, and interpretation of the acquired images affect reliability [69,70]. This may explain why reliability for PD signal, GSS and osteophyte was better for US imaging studies that used static images than those that used real-time scan (Supplementary Table S8, available at Rheumatology online).
Reliability was comparable when dichotomous and semi-quantitative scores were used. However, reliability was best for summated scores, particularly for MRI assessed JSN, osteophytes and CBE. Reliability assessment of imaging features using summated score could potentially be overoptimistic, since it is based on the sum of grades of a pathology in the whole hand without accounting for the individual joints that are affected. For example, if rater A scores as grade 2 a pathology in the second to fifth PIPJs in one participant and rater B scores as grade 2 a pathology in the second to fifth DIPJs of the same participant, agreement between the two raters will be almost perfect for summated score but poor for dichotomous and semi-quantitative assessments.
The frequency range used for B-mode scan across studies ranged from 11 to 18 MHz, except for one study that used a high resolution probe with frequency up to 22 MHz [24]. Reliability estimates for osteophytes and synovitis were better for assessment on images acquired with scan frequency 15 MHz than frequency 14 MHz. It is noteworthy that as scan frequency increases, spatial resolution increases but tissue penetration reduces [71,72], which makes high-frequency probes suitable for scanning joints that are superficial such as finger joints. Conversely, MRI scanner strength used across studies ranged from 1.0 to 3.0 T. Inter-rater reliability of detection of MRI synovitis, osteophytes and CBE was better for assessments performed on images acquired with 1.0 T than 1.5 T scanners. Of three studies that utilized 3.0 T scanners to examine osteophytes,  inter-rater reliability was slight in two [35,58], but almost perfect in one study [59]. Both studies with slight agreement performed dichotomous assessment whereas the latter performed sum score assessment. It is noteworthy that the experience of the raters could also contribute to the variable reliability. Across the three studies, only one stated years of experience of the raters, which was 12-13 years [58]. Therefore, further studies are required to explore the impact of scanner quality on the reliability of detecting features of hand OA. Furthermore, subgroup analyses based on training background showed that reliability assessments of US and MRI features of hand OA were broadly comparable for rheumatology and imaging trained assessors. There were subtle numeric differences for some imaging features, but these were not significantly different. These findings suggest that rheumatology trained assessors may be sufficient to undertake research and clinical assessments using US or MRI in people with hand OA.
Several studies have adopted the OMERACT definitions [73] and semi-quantitative scoring methods for US pathologies in hand joints [74], which were originally developed for RA but adapted for assessment of hand OA. This has been criticized as it could contribute to a floor effect when assessing inflammatory changes [14] since inflammation is only present at a low level in hand OA [44], necessitating development of scoring systems tailored to hand OA. Keen and colleagues have developed a preliminary scoring system for US features of hand OA [11], which is gaining widespread usage. However, this is not accompanied with representative images for reference purposes, while a few atlases have been developed for scoring of osteophytes [31] and cartilage damage [27].
There are potential limitations to this review. Firstly, we focused only on hand OA. Therefore, findings are not generalizable to OA in other joints. However, our findings are consistent with those of previous metaanalysis on knee OA [14]. Secondly, there was significant heterogeneity across studies included in the analyses. Therefore, caution should be applied when interpreting findings from this review. Nevertheless, this review highlights the reliability of US and MRI in detecting features of hand OA. Additionally, it highlights a lack of representative imaging atlas devised for most US features of hand OA.
In conclusion, both US and MRI are reliable in detecting hand OA changes. However, further standardization of techniques and development of representative atlases for all imaging features of hand OA are essential.